[Previous] [Next] [Index]
[Thread]
Site Scaning & IP graps
[You ("Brian W. Spolarich")]
> Good spiders will ask for /robots.txt and find out what to do with
>themselves if they find it.
>
> Generally grepping for /robots.txt will give you a list of spiders that
>have found you.
Very true. In fact, on my server I've ScriptAliased /robots.txt to
the following little perl script. This lets me grab a little more information
from the robot which the server by default doesn't get, namely, the
HTTP_FROM address advertised.
--------------------------code snippet
#!/usr/bin/perl
$Log = '/var/adm/httpd_robots';
@Interesting = ('HTTP_USER_AGENT', 'REMOTE_ADDR', 'REMOTE_HOST', 'HTTP_FROM');
print "Content-type: text/plain\n\n";
print "User-agent: *\nDisallow:\n\n";
open(LOG, ">>$Log") || die("Can't open $Log: $!\n");
print LOG '[' . localtime() . ']';
foreach $env (@Interesting) {
print LOG "\t$env=$ENV{$env}";
}
print LOG "\n";
close LOG;
--------------------------end code snippet
Some of the lines produced by this (I've wrapped returns with '\'):
[Thu Feb 8 00:44:51 1996] HTTP_USER_AGENT=Scoutget 1.0 REMOTE_ADDR=206.\
101.96.35 REMOTE_HOST=seventeen.srv.lycos.com HTTP_FROM=
[Thu Feb 8 01:48:31 1996] HTTP_USER_AGENT=OTI_Spider/OTWR:002p116 libwww/\
2.17 REMOTE_ADDR=205.216.146.179 REMOTE_HOST=205.216.146.179 HTTP_FRO\
M=gregf@opentext.com
[Thu Feb 8 15:29:17 1996] HTTP_USER_AGENT=OTI_Spider/OTWR:002p116 libwww/\
2.17 REMOTE_ADDR=205.216.146.179 REMOTE_HOST=dialup-a.mv.opentext.com\
HTTP_FROM=gregf@opentext.com
[Sun Feb 11 03:00:29 1996] HTTP_USER_AGENT=CERN-LineMode/2.15 libwww/2.17\
REMOTE_ADDR=199.107.235.42 REMOTE_HOST=199.107.235.42 HTTP_FROM=vic@ap\
ollo.alphaspace.com
Interestingly, it seems that Lycos doesn't populate the HTTP_FROM environment.
Odd.
.....A. P. Harris...apharris@onShore.com...<URL:http://www.onShore.com/>
References: